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Abstract. We consider the problem of online audio source separation. 
Existing algorithms adopt either a sliding block approach or a stochas- 
tic gradient approach, which is faster but less accurate. Also, they rely 
either on spatial cues or on spectral cues and cannot separate certain 
mixtures. In this paper, we design a general online audio source separa- 
tion framework that combines both approaches and both types of cues. 
The model parameters are estimated in the Maximum Likelihood (ML) 
sense using a Generalised Expectation Maximisation (GEM) algorithm 
with multiplicative updates. The separation performance is evaluated as 
a function of the block size and the step size and compared to that of an 
offline algorithm. 

Keywords: Online audio source separation, nonnegative matrix factori- 
sation, sliding block, stochastic gradient. 



1 Introduction 

Audio source separation is the process of recovering a set of audio signals from 
a given mixture signal. This can be addressed via established approaches such 
as Independent Component Analysis (ICA), binary masking and Sparse Com- 
ponent Analysis (SCA) [1] or more recent approaches such as local Gaussian 
modeling and Nonnegative Matrix Factorisation (NMF) [2 . Most current algo- 
rithms are offline algorithms which require the whole signal in order to estimate 
the sources. In this paper, we focus on online audio source separation, whereby 
only the past samples of the mixture are available. This constraint arises in 
particular in real-time scenarios. 

A few online implementations have been designed for ICA [3] [1], time- 
frequency masking [S], local Gaussian modeling [B], spectral continuity-based 
separation [7] and NMF [5]. However, these algorithms rely either on spatial 
cues [3] - [B] or on spectral cues |7|8I alone. Such algorithms are not capable 
of separating mixtures where several sources have the same spatial position and 
several sources have similar spectral characteristics. For example, in pop music, 
the voice, the snare drum, the bass drum and the bass are often mixed to the 
centre and several voices or several guitars are present. 
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In order to address this issue, we consider the general flexible source sepa- 
ration framework in [5] . This framework generalises a wide range of algorithms 
such as certain forms of ICA, local Gaussian modeling and NMF, and enables 
the specification of additional constraints on the sources such as harmonicity. 
By jointly exploiting spatial and spectral cues, it makes it possible to robustly 
separate difficult mixtures such as above. 

The two main approaches for online source separation are the sliding block 
(also known as blockwise) approach, as used in |3] [1] [S] [7], and the stochastic 
gradient (also known as stepwise) approach, as used in [6] [8]. The sliding block 
method consists in applying the offline audio source separation algorithm to a 
block of M time frames. Once this block of signal has been processed, a frame 
is extracted for each of the J sources before sliding the processing block by one 
frame. This approach is computationally intensive but accurate. The stepwise 
method offers to update the model parameters in every frame using only the 
latest available frame and the model parameters estimated in the previous frame. 
As it uses only the latest available frame at a given time, this approach is faster 
than the sliding block approach but can be inaccurate. 

In this paper, we propose a general iterative online algorithm for the source 
separation framework in [9j that combines the sliding block approach and the 
stepwise approach using two hyper-parameters: the block size M and the step size 
a. As a by-product, we provide a way of circumventing the annealing procedure 
in [3], which would require a large number of iterations per block. Moreover, we 
determine the best trade-off between these two approaches experimentally on a 
set of real-world music mixtures. 

The structure of the rest of the paper is as follows: the flexible framework in 
[5] is introduced in Section 2. Section 3 presents the online algorithm. Experi- 
mental results are shown in Section 4. The conclusion can be found in Section 
5. 

2 General audio source separation framework 

We operate in the time-frequency (TF) domain by means of the Short-Time 
Fourier Transform (STFT). In each frequency bin / and each time frame n, the 
multichannel mixture signal x(/, n) can be expressed as 



where J is the number of sources and Cj(/, n) is the STFT of the spatial image 
of the j-th source. 

2.1 Model 

We assume that Cj{f,n) is a complex- valued Gaussian random vector with zero 
mean and covariance matrix Rc^ (/, n) 



J 




(1) 



c, ~A/;(0,RcJ 



(2) 
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and that Rc^ (/, n) factors as 

Rc^.(/,n)-R,(/>,(/,n) (3) 

where Rj(/) is the spatial covariance matrix of the j-th source and Vj{f,n) is 
its spectral variance. 

In [S], Rj(/) is expressed as Rj(/) = Aj(/)A^(/), and Aj{f) is estimated 
instead. This results in an annealing procedure, which would translate into a 
large number of iterations within each block in our context. In order to cir- 
cumvent the annealing, we assume that Rj(/) is full-rank and directly estimate 
Rj(/) instead, similarly to [lOj . 

The spectral variance Vj{f, n) is modeled via a form of hierarchical NMF [9]. 
The matrix of spectral variances Vj = n)]j^„ is first decomposed into the 

product of an excitation spectral power and a filter spectral power 

V, =V^^0Vf (4) 

where © denotes entrywise multiplication. is further decomposed into the 
product of a matrix of narrowband spectral patterns W^, a matrix of spectral 
envelope weights U^, a matrix of temporal envelope weights and a matrix 
of time-localised temporal patterns HJ, so that 

V^^ = W^U^G^H^. (5) 

is decomposed in a similar way. 

This factorisation enables the specification of various spectral or temporal 
constraints over the sources. For example, harmonicity can be enforced by fixing 
to a set of narrowband harmonic patterns. 

2.2 Offline EM-MU algorithm 

In an offline context, the model parameters are estimated in the Maximum Like- 
lihood (ML) sense by a Generalised Expectation-Maximisation (GEM) algo- 
rithm combined with Multiplicative Updates (MU) applied to the complete data 
{c,(/,n)}. 

The log-likelihood is defined using the empirical mixture covariance matrix 
R,(/,n) ;10, as 

log£ = ^-tr(R,i(/,n)Rx(/,n)) -logdet(7rR,(/,n)) (6) 

where 

,7 

Rx(/,n)=^Rc,(/,n) (7) 

i=i 

is the covariance of the mixture x(/, n). 
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In the E-step, the expectation of the natural statistics is computed via [10] 

=Rc^(/,n)R,i(/,n) (8) 
Re,(/,n) =n,(/,n)Rx(/,n)f2f + W,(/,n))Re^.(/,n) (9) 

where ftj is the Wiener filter, I is the I x I identity matrix and / is the number 
of channels of the mixture. 

In the M-step, the model parameters are updated as [9110] 

1 ^ 1 ^ 

I^.(/) = ]7E^I^c.(/,n) (10) 
[Hj V]r^ yf .^i](U^G^H^^^ 

TTx — TP n ^ ^ J J J' 

^3 ^3^ TTrxT-x^rx -WpxTTxNT \^^) 

3 y ^ 3 3' 



W = W ' ' " ' ^' ^—^ — ' ^ (11) 



'3 ^3 '-^ wxTyx ~\{riy-\x^\T 

i 7 • y^j-^'-j J 



-G"0 ^ ^ ' ; ^ (13) 



(W^-Up^[H, V^.-^ Vf .-i]H^ 



T 



^3 -H-0 ^ ^ ^.^4TTx^x^T^;x -1 ^^ (14) 



where .'^ denotes entrywise raising to the power p, N is the number of time 
frames in the STFT of the signal and Sj = [£,j{f, "■)]/>, with 

f,(/,n) = itr(R-i(/)Re,(/,n)). (15) 

Wj , U*, GJ and are updated in a similar way. 

After each EM iteration, the model parameters are normalised: the mean of 
Rj, W^, U^, G^, H^, Wj, Uj and Hj are normalised to 1 while Gj is multiplied 
by the product of the normaUsation factors of the other variables. 

The separated sources are then obtained via 

c,(/,n)=n,(/,n)x(/,n). (16) 



3 Online EM-MU algorithm 

We now consider an online context where in each time frame t, the data is 
limited to a block of M STFT frames indexed by n with t-M + l<n<t, 
where M = 1 for the stepwise approach and M = for the full offline approach. 
We define a step size coefRcient a £ ]0; 1] to stabilise the parameter updates by 
averaging over time. For each block, the spatial covariance matrices R^*''(/) are 
initialised to a diffuse spatial covariance spanning a part of the audio space. The 
temporal weights G^^*-* are randomly initialised and the normalised to the mean 
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spectral power of the signal. Finally, the temporal patterns H^^*-* are initialised 
to diagonal matrices. The expectation of the natural statistics is computed using 
([8]) and ([9]) for t — M + 1 < n < t, whilst the spatial covariance matrix is updated 
as follows: 

Rf(/)^(l-a)Rr)(/)+J^ ± (17) 

\ n=t-Af +1 1^-" ' J 

where the superscript denotes is the value of matrix for the block t. 

Gf*^ and H^^*) are updated using and for t - M + 1 < n < t. as 
they are expected to significantly vary between blocks, whereas the updates of 
and become 

]y[x(*) 

W^(*) = W^(*) (18) 

where 

M^^(*^ = (1 - a)Mf-^^ + a[S, V^^-^ Y'j.-^]{ljf'^Gf^Ilf'^f (20) 
Cx(*) ^ (1 _ Q,)c^(*-i) + aV].-^{Vf''>Gf*'^Uf'^f (21) 
N!^« = (1 -a)Nf +aWf *)^[H, 0VJ.-2 Qvf _-i](Qx(t)jjx(t))T(22) 
D^(*^ = (1 - a)-Df'^'> + aWf>^V].-\Gf>-H.f^f (23) 

where H^^ is computed as in ([II]). Mf\ C^/'\ N^^*^ and B^'^ are updated 
in a similar way. At each block, several iterations can be performed in order to 
improve the estimation of the model parameters. 

Although equations (flT)) to (IT9)) look similar to the online update of the local 
Gaussian model in [6 and 8 , there are two crucial differences: 

— The framework introduced in the current paper is more general in the sense 
that it uses hierarchical NMF, enabling the user to apply more specific con- 
straints than when using shallow NMF. 

— It is not limited to the sole use of the latest audio frame. 



4 Experimental results 

We compared the performance of the online audio source separation framework 
to the offline framework introduced in section [221 s-s a function of the number of 
EM iterations, a and M. The project aiming at remixing of recordings for sound 
engineers, DJs and consumers, we processed five 10 s long stereo commercial 
pop recordings composed of bass, drums, guitars, strings and voice. All the 
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recordings were recorded at 44100 Hz. The STFT was computed using half- 
overlapping 2048 sample sine windows. In the offline algorithm as well as in the 
online algorithm, each of the modeled sources were constrained in a way similar 
to section V. C in [9]. In the case of an harmonic source, W^*'*-' was fixed to a 
set of narrowband harmonic spectral patterns and the spectral envelope weights 
in U^^*^ were updated, whereas for bass and percussive sources, W^^*-* was a 

fixed diagonal matrix and U^^*'' was a fixed matrix of basis spectra learned over 
a corpus of bass and drum sounds. 

Audio samples of the separated sounds of this experiment can be found on 
|http ://www. irisa.fr/metiss/lssimon/LVA2012/index. html . 

Separation performance was evaluated using the Signal-to-Distortion Ratio 
(SDR), the Signal-to- Interference Ratio (SIR), the source Image to Spatial dis- 
tortion Ratio (ISR) and the Source-to- Artifacts Ratio (SAR) defined in [TT] . For 
each set of conditions over the number of iterations, M and a, each of these 
criteria was averaged over all the mixtures and all the separated sound sources. 
Over all the results of this experiment, the SDR varied between -1.1 and 0.9 dB, 
the SIR between -4 and 1 dB, the ISR between 2.3 and 3.9 dB and the SAR 
between 10 and 19 dB. 



Table 1. Separation performance (dB) of the offline and best online algorithms. 



Algorithm 


a. 


M 


number of iterations 


SDR 


SIR 


ISR 


SAR 


offline 


N/A 


N/A 


100 


0.8586 


1.2837 


3.7989 


13.3872 


online 


1 


50 


30 


0.8671 


1.0675 


3.9690 


12.3278 



As shown in table [U when a = 1, M = 50 and 30 GEM iterations are 
performed, the separation performance of the online algorithm is close to that 
of the ofhine algorithm. For smaller block size and smaller number of iterations, 
the performance decreases. For example, for M = 10 and 6 GEM iteration, the 
SDR is 0.53 dB and the SIR is 3.53 dB. More generally, fig. [I] shows that for 
a = 1, increasing either the block size or the number of iterations increases 
the SDR, though the block size has less effect on the SDR than the number of 
iterations. The results also show that increasing the number of iterations from 
10 to 30 increases the SDR by 0.2 dB, which can be considered as a significant 
improvement. 

When a < 1, the SDR decreases significantly as can be seen in fig. [T] It can 
also be seen that increasing the number of iterations decreases the SDR and 
changes of block size have little to no effect on the SDR. This can be explained 
by an inaccurate estimation of the model parameters of certain sources in the 
time intervals when these sources are inactive. These inaccurate parameters are 
then carried over subsequent time frames and may not converge back to accurate 
values. This undesirable effect is particularly salient for those parameters that are 
less constrained. For instance, with the considered model, the spatial covariance 
matrices of all sources gradually diverge towards a diffuse spatial covariance 



A General Framework for Online Audio Source Separation 7 

spanning all directions in the mixture, while the effect is more limited for spectral 
parameters which are fixed or heavily constrained. Potential solutions to this 
problem are presented in the conclusion. 



a = 0.02 a = 0.05 a = 0.1 




2 6 10 30 2 6 10 30 2 6 10 30 

number of iterations number of iterations number of iterations 



Fig. 1. Mean SDR for all sources and all mixtures, as a function of a, M and step size. 



5 Conclusion 

In this paper, a new framework for online audio source separation was presented. 
This algorithm offers an increased flexibility both in terms of the range of con- 
straints that can be specified for each source and of the choice of a trade-off 
between separation accuracy and computational cost. It was shown that the 
separation accuracy is higher when the block size is large, but that small block 
sizes nevertheless offer an acceptable separation. However, small step sizes cause 
the spatial covariance matrices to diverge due to the presence of silence intervals 
in the sources. 

This issue is well-known in the beamforming literature where a voice activ- 
ity detector is used to restrict the time frames in which the model parameters 
are updated [T^]. While this solution does not readily extend to source sepa- 
ration, we believe that there exist a number of alternative promising solutions, 
e.g. adding soft constraints over the least constrained parameters by means of 
probabilistic priors, using different step sizes for the most constrained and the 
least constrained parameters, and using signal-dependent step sizes related to 
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the power of Rc^ (/, n) such that the parameters are not updated in the time 
intervals with low power. 

Future work should also include an optimisation of the initialisation of the 
model parameters for each new block. After these improvements, we expect that 
the proposed framework will reach its full potential and provide a better trade-off 
between separation performance and computational cost. 
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